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Abstract 

Background: Accurate genetic maps are the cornerstones of genetic discovery, but their construction can be 
hampered by missing parental genotype information. Inference of parental haplotypes and correction of phase 
errors can be done manually on a one by one basis with the aide of current software tools, but this is tedious and 
time consuming for the high marker density datasets currently being generated for many crop species. Tools that 
help automate the process of inferring parental genotypes can greatly speed the process of map building. We 
developed a software tool that infers and outputs missing parental genotype information based on observed 
patterns of segregation in mapping populations. When phases are correctly inferred, they can be fed back to the 
mapping software to quickly improve marker order and placement on genetic maps. 

Results: ParentChecker is a user-friendly tool that uses the segregation patterns of progeny to infer missing 
genotype information of parental lines that have been used to construct a mapping population. It can also be 
used to automate correction of linkage phase errors in genotypic data that are in ABH format. 

Conclusion: ParentChecker efficiently improves genetic mapping datasets for cases where parental information is 
incomplete by automating the process of inferring missing genotypes of inbred mapping populations and can also 
be used to correct linkage phase errors in ABH formatted datasets. 
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Background 

Lack of knowledge of the parental phase of all alleles 
segregating in mapping populations can impinge on the 
accuracy of genetic maps. Recombinant inbred line 
(RIL) populations developed from two inbred lines are a 
powerful resource for construction of genetic linkage 
maps. However, it is not uncommon to observe segrega- 
tion of markers in RILs that are observed to be fixed in 
the putative inbred parents of the RIL, and conversely, 
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to observe markers that are polymorphic in the two RIL 
parents, but fixed in the RIL population. This indicates 
that the real parents used in the cross to develop the 
RIL population are different than the available "off par- 
ents". This situation probably has two primary causes: 1) 
where one or both parents were not completely inbred 
at the time the population was initiated, or 2) from the 
existence of residual genetic variation within one or 
both parental lines. This observation is not surprising 
given that ten or more years can pass between the time 
when a RIL population is initiated with a cross between 
two parent plants and the time when it is genotyped 
along with the presumed parental lines. In both 
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scenarios, one plant of an inbred line was used for the 
initial hybrid, while another closely related plant of the 
same inbred line was used for genotyping years later. 
Thus for case 1) where the original parent plant was 
heterozygous (Aa) at some fraction of its genome at the 
time of crossing and then subsequently maintained by 
inbreeding, the current (more inbred) version of the 
'parent' line will have become fixed randomly for either 
AA or aa, causing the 'unexpected' segregation in the 
RIL half of the time. For case 2, it is not hard to envi- 
sage the existence of limited genotypic differences 
among individuals within an inbred crop line or variety 
because it has been standard practice to produce foun- 
dation seed-stocks of new cultivars from 'headrow' bulks 
of 'on-type' highly inbred sublines [1]. Residual genetic 
variation in homozygous form will be captured in the 
bulk constituting the Breeder's Seed of such cultivars 
that can then manifest itself in genetic differences 
between an individual selected as a parent for RIL popu- 
lation development and another individual of the same 
line or cultivar that is genotyped. 

In other cases, the original parental seed source used to 
make the RIL population may have been lost as a result 
of error or project discontinuity, such as personnel 
changes, which may further complicate the identity of 
the real parent(s). The problem for the production of a 
genetic map is that it is advantageous to know the paren- 
tal phase of all alleles, but the "off-parent" genotypes can- 
not be used to infer the allele phase of every marker. This 
"off-parent" problem is most severe when the alleles of 
both parent stocks are opposite from the alleles in the 
actual parents of the initial F x plant. However, as long as 
the genotyped parental stocks are genetically very similar 
to the actual parents, enough information resides in the 
mapping population to correctly infer the haplotype 
composition of the actual parents. 

Prior to the advent of high density genotyping, lack of 
marker coverage limited the prospects to detect cases of 
the off-parent problem and to correctly infer the actual 
parent. High-density genotyping greatly increases the 
opportunity to observe the off-parent problem and 
enables the inference of actual parental genotypes [2,3]. 
The increased number of inferences needed with high 
density genotyping data sets speaks to the need for tools 
that automate the process of parental inference. 

Here we present a new software package, ParentCh- 
ecker, that addresses two common needs in the prepara- 
tion of genotyping data for mapping with inbred 
populations in plant species: 1) inference of the actual 
parental haplotype, which is relevant to biallelic or 
ACGT format datasets, and 2) automatic correction of 
the phase of markers in individuals in the mapping 
population if the markers are expressed in biallelic for- 
mat and the parental genotypes are unknown. 



Implementation 

The current version of ParentChecker was developed 
to handle single-nucleotide polymorphism (SNP) data 
(in ACGT format). However, it also works for other 
co-dominant markers that are coded in A, B, H or AA, 
AB, BB format. ParentChecker is very efficient in 
terms of memory storage and computational speed. On 
a desktop computer with CPU 2.0 GHz/2 GB RAM, 
ParentChecker only needs a few seconds to process 
genetic data from a relatively large segregating popula- 
tion (e.g., 500 individuals with 1000 SNPs). The algo- 
rithms implemented by ParentChecker to infer the 
unknown parental genotypes and linkage phase are as 
follows: 

Parental genotype inference 

Parents used to derive inbred mapping populations are 
usually assumed to be pure lines. In practice, the par- 
ents are often heterozygous for some limited number of 
loci. Table 1 shows three types of gene transmission pat- 
terns for a polymorphic marker when a RIL population 
is derived. Initially, most loci are heterozygous for both 
parents. The segregation ratio for genotypes AA:Aa:aa is 
1/4:1/2:1/4 for an F 2 population. The ratio becomes 3/8: 
2/8: 3/8 for an F 3 . For each additional generation of self- 
ing, the proportion of heterozygotes is reduced by half 
and the reduced part is equally divided and added to 
the two homozygotes. Therefore, the theoretical ratio 
between the two homozygotes is always 1:1. However, 
there is no theoretical genotype proportion for the two 
homozygotes when the cross is made by crossing a 
homozygote and a heterozygote during the construction 
of the population. Therefore, a % test can be used to 
determine whether the expected proportions of homozy- 
gous individuals are statistically different than 1:1 and 
thus infer the cross type (e.g. whether it was AA x aa or 
Aa x aa) for the parents by comparing the observed 
genotype proportions and the theoretical values listed in 
Table 1 When a small population is obtained in an 
advanced generation, the decreasing proportion of the 
heterozygotes will cause bias to the statistics. Therefore, 
a special algorithm is needed to adjust for this bias. In 
ParentChecker, two statistical tests were used to infer 
the parental genotype: (a) calculating the statistical test 
for the ratio of two homozygotes against the theoretical 
ratio of 1:1, which can be calculated by 

X 2 = {Paa - Paa) 2 /{PAA - Paa) ~ Xy=\ i (b) calculating 
the statistics for the ratio the major homozygote to the 
sum of the other two genotypes against the theoretical 
ratio defined as 

/3 1 \ /l 1 \ , , 
^homozygous 1 ; Phomozygous2 + heterozygotes = ( ~ +1 I : ( ~ + +1 I (1) 
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Table 1 Theoretical proportions of genotypes in segregating populations generated over 1, 2, 3 and n generations of 
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where the major homozygote is defined as the homo- 
zygote with frequency higher than that of the other 
homozygote. The test statistics is calculated as 



where iV is the population size, O homozgousl and O homoz . 
g ous2+ heterozygotes are observed frequencies for major 
homozygote and that of the other two genotypes, respec- 
tively. ParentChecker assigns the cross type with the 
smallest statistics value. If (a) is accepted, the segregating 
population is assumed to be derived from homozygous 
parental genotypes; otherwise, the initial cross is assumed 
to have been made between a homozygote and a hetero- 
zygote. Although a cross between two heterozygotes can 
also produce the same ratio as (a), ParentChecker only 
suggests the cross type of two homozygotes because the 
probability of a mating between two heterozygotes can 
be assumed to be very low for known inbred lines of self- 
pollinated species and safely ignored. In addition, the 
initial step for cross type Aa x Aa can also be regarded as 
the cross between two ¥i individuals. Therefore, there is 
no fundamental difference between Aa x Aa and AA x 
aa, especially for advanced generations. 

Linkage phase inference 

Consider three adjacent markers that are dispersed 
along a linkage group as follows: 

M1 M2 M3 

1 1 1 

□1 d2 

During meiosis, the frequency of crossovers for each 
interval is assumed to be independent of other intervals, 
which means that the recombination frequency between 
two adjacent markers depends only on the interval size 
bracketed by the two markers and is not affected by 
other intervals. Therefore, only the genotypic informa- 
tion of the two markers is relevant for the inference of 
the linkage phase. This feature allows the use of a hid- 
den Markov model. 



Assume that the two alleles for marker Ml are A and 
a and the two alleles for marker M2 are B and b. The 
parental haplotype for generating the segregating popu- 
lation is either in coupling phase (AABB and aabb) or in 
repulsion phase (AAbb and aaBB). Since the linkage 
phase is a dichotomous event, we consider the coupling 
phase as status 1 and the repulsion phase as 0. If the 
hypothesis of coupling phase is rejected, the repulsion 
phase is accepted. 

Frequencies of the two-locus genotypes are listed in 
Table 2. Gametes that generate the individuals of the 
mapping population are grouped into four categories: (I) 
parental type X parental type; (II) parental type X 
recombinant type; (III) recombinant type X parental 
type; and (IV) recombinant type X recombinant type. 
Since the frequencies of types II and III are not affected 
by the linkage phase and the double heterozygote fre- 
quencies are identical in types I and IV, the four geno- 
types (AABB, aabb, AAbb, and aaBB) as shown in the 
diagonal of Table 2. 

Table 2 is used to infer the linkage phase of the two 
markers. Although the linkage phase can be investigated 
by comparing the observed ratio of the parental geno- 
types to the recombinant genotypes with the theoretical 

Table 2 Genotypes formed by gametes and their 
frequencies under the coupling phase hypothesis. 
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ratio calculated from the length of the interval, a more 
convenient approach is to test directly whether the 
observed frequency of the parental genotypes is larger 
than that of the recombinant genotypes. The null 
hypothesis is P p = P r = 0.5 while the alternative hypoth- 
esis is P p >P„ 
where 

P(AABB + aabb) _ (1-n) 2 

p ~ P(AABB + aabb) + P{AAbb + aaBB) ~ 1-2(1 -r^rj 

and 

P(AABB + aabb) r\ , . 

' ~ P{AABB + aabb) + P(AAbb + aaBB) ~ 1-2(1 -r^n ^ ' 

The recombination frequency between Ml and M2 is 
denoted by r x and is calculated from dl using Haldane's 
[4] or Kosambi's [5] map function. The null hypothesis 
can be tested using x 2 = (Pp - Prfl{Pp + P r ) ~ Xv=i ■ 
However, in practice, calculating the test statistics is 
unnecessary for linkage phase inference even if the 
interval size is relatively large. For example, let the dis- 
tance between Ml and M2 be 30 cM, the theoretical 
values for P p and P r and are 0.9218 and 0.0782, respec- 
tively. Suppose that there are only 50 individuals in total 
for the four genotypes in the diagonal of Table 2 in the 
segregating population. Even if the observed numbers of 
individuals for AABB + aabb and AAbb + aaBB are 35 
and 15, respectively, the statistical test is still significant 
because the /7-value is 0.0006. Therefore, if the observed 
counts for AABB + aabb are larger than that of AAbb + 
aaBB, it is statistically safe to suggest that the linkage 
between Ml and M2 is coupling if the observed P p is 
larger than P r without calculating the test statistics. 

Results and discussion 

ParentChecker uses the segregating patterns of markers 
and a linkage map to infer the parental genotypes that 
produced the segregating population. The formulas that 
are implemented in the current release of ParentCh- 
ecker rest on two assumptions: the molecular markers 
are codominant and markers exhibiting distorted segre- 
gation have been removed from the dataset. Users are 
strongly suggested to use the built-in functions of Par- 
entChecker to remove incompetent markers from the 
genotypic data before exporting the final outputs. 
Although the fundamentals of phase inference in linkage 
analysis has been discussed in detail [6-9], the strategy 
employed in ParentChecker in handling phase issues is 
slightly different from other approaches. We used the 
Chi-square test in an intuitive way instead of a maxi- 
mum likelihood method and implemented this by an 
expectation-maximum algorithm, to infer the linkage 
phase. It only requires a minimal amount of calculation, 



which is helpful for handling high density SNP data. 
Furthermore, it offers a convenient way to determine 
the correct linkage phase at a high level of statistical 
confidence without requiring actual calculation of test 
statistics. 

For SNP data, a recommended workflow for Par- 
entChecker would be to load data in ACGT format and 
use the output information (inferred parent) from Par- 
entChecker for subsequent analysis such as building 
improved maps and QTL detection. For SNP data 
inputted in ACGT format, ParentChecker can generate 
an output in ABH format suitable for mapping and 
QTL detection. Furthermore, ParentChecker can directly 
export input files for popular genetic software packages 
including Flapjack [10], GGT [11], MapQTL [12], 
PowerMarker [13], Structure [14], and Tassel [15]. For 
other types of molecular marker data (e.g. SSRs) that 
are coded in ABH format, ParentChecker can be used 
to automatically correct linkage phase errors, which may 
be caused by missing values and genotyping errors [16] 
in parental genotypic data. But unlike Joinmap [17], 
Flapjack [10] and GGT [11], ParentChecker automati- 
cally recodes the genotypic data according to the linkage 
phases it inferred and a user interference is not 
necessary. 

The input data format for ParentChecker is flexible. 
ParentChecker can take data directly from tab-delimited 
text files or import data from an Excel clipboard. The 
order of the markers in the genotype file does not have 
to match the order of the markers in the map as long as 
the marker names are consistent between the two files. 

ParentChecker efficiently improves mapping datasets 
for cases where parental information is incomplete. The 
observation of missing parental haplotypes in the devel- 
opment of a consensus map of cowpea [18] spurred the 
development of ParentChecker. The consensus map was 
constructed by merging individual maps made from 11 
RIL and 2 F 4 mapping populations that had been geno- 
typed with the Illumina 1536-SNP GoldenGate Assay 
[19]. Nine of the 11 RILs and both F 4 populations had 
at least one case of missing parental genotype informa- 
tion, with the number of missing parent data totalling 
310 instances and ranging from 1 to 107 per mapping 
population (Table 3). An iterative process was employed 
which included detecting suspicious linkage phases 
using JoinMap4, correcting the linkage phase errors 
manually, and re-checking the parental phase visually 
with Flapjack. This tedious one-by-one process produces 
correct phase designations, however, it requires user- 
based decisions which are time consuming and which 
can be subjective. Of the 310 additional SNP data points 
where phase was assigned arbitrarily, one-hundred and 
forty-eight, or approximately half, required phase rever- 
sal. Using the manual method with JoinMap4 potential 
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Table 3 An excerpt from Lucas et al. 2011 [18] 
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linkage maps had to be generated each time a marker 
exhibited characteristics of an uncertain parental phase. 
These potential maps were then checked within Join- 
Map4 [17] and visually through Flapjack [10] and re- 
mapped, if necessary. The process required numerous 
iterations until a satisfactory fit was obtained and paren- 
tal phase finally assigned. ParentChecker is able to 
accomplish this task in less than 2 minutes. Given the 
large datasets currently being generated in many crops 
by high-throughput genotyping platforms, there is a 
need for the automation of parental inference and data 
export flexibility provided by ParentChecker. 

Conclusions 

ParentChecker is an automated tool designed to effi- 
ciently infer parental genotypes for improved map reso- 
lution. It also helps researchers to recode genotypic data 
to match the underlying linkage phase of RIL 
populations. 

Availability and requirements 

Project name: ParentChecker 

Project home page: http://statgen.ucr.edu/software. 
html 

Operating system(s): Windows XP/7 
Programming language: Delphi 
License: Freeware 

Any restrictions to use by non-academics: None 
Additional materials: Two sample datasets from our 
cowpea project are provided in the ParentChecker pack- 
age for testing and demonstration purposes. 
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